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Abstract 

Haplotyping is the bioinformatics problem of predicting likely haplotypes based on given geno- 
types. It can be approached using Gusfield's perfect phylogeny haplotyping (PPH) method for which 
polynomial and linear time algorithms exist. These algorithm use sophisticated data structures or 
do a stepwise transformation of the genotype data into haplotype data and, therefore, need a linear 
amount of space. We are interested in the exact computational complexity of PPH and show that 
it can be solved space-efficiently by an algorithm that needs only a logarithmic amount of space. 
Together with the recently proved L-hardness of PPH, we establish L-completeness. Our algorithm 
relies on a new characterization for PPH in terms of bipartite graphs, which can be used both to 
decide and construct perfect phylogenies for genotypes efficiently. 

1 Introduction 

In human genetic variation studies, sequencing methods are applied that read out the genetic information 
at SNP (single nucleotide polymorphism) sites for multiple individuals. In order to be low-priced and 
feasible, these methods determine, for each site separately, the present bases, of which there can be two 
since the human DNA is arranged in pairs of chromosomes. For each individual in the variation study 
this yields a genotype that describes the bases at SNP sites. While the genotype says for every site which 
bases are present, it lacks the information on how the bases are assigned to the chromosomes of a pair. 
This information, which is described by haplotypes, is crucial to describe fine-grained genetic variation. 

The objective of haplotyping is to compensate the drawback of genotype data by predicting biolog- 
ically reasonable haplotypes computationally. Gusfield [6] proposed an approach to haplotyping that 
seeks haplotypes that are arrangeable in a perfect phylogenetic tree [6|. He showed that this problem, 
which will be called perfect phylogeny haplotyping (PPH) is solvable in polynomial time by a reduction 
to the graph realization problem . Due to the practical importance of haplotyping, several groups also 
proposed simpler polynomial time [f,"?] and linear time algorithms EllTlIll for Gusfield's approach. 

In the present paper we study the space complexity of PPH. In |l3l we showed that PPH is hard for 
the complexity class L (deterministic logarithmic space) and lies in the counting class ®L ||3l (see this 
paper for a wider discussion of the haplotyping issue and complexity theoretic terms). The main open 
problem of fSl, namely, whether PPH lies in the class L, is answered affirmatively by the present paper. 
To prove this result, we present a graph-based characterization that extends ideas from Eskin, Halperin 
and Kaip [4|. Given a set of genotypes, they build, for each genotype separately, graphs where the 
vertices represent sites and edges represent known relations between them. Based on these graphs, they 
proved that the existence of a perfect phylogeny is related to the question whether one can extend the 
known relations, such that all graphs become complete bipartite. Our characterization avoids the step 
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of guessing new relations between pairs of sites: We determine all relevant relations beforehand and 
directly construct graphs that are bipartite if, and only if, there is a perfect phylogeny. Since the graph 
construction can be described by first-order formulas and the problem of deciding whether a graph is 
bipartite lies in L |[8l, we are able to prove the following theorem: 

Theorem 1.1. pph is complete for deterministic logarithmic space. 

The paper is organized as follows: In Section [2] we provide a formal definition of the PPH problem 
and induced sets of pairs of sites. In Section |3] we prove our graph based-characterization and, after 
that, we reduce pph to the problem of whether an undirected graph is bipartite in Section]?] 

2 Perfect Phylogenies and Induced Sets 

Since only two different bases are present at the majority of SNP sites, it is convenient to code haplotypes 
as strings over the alphabet {0,1}, where for a given site stands for one of the bases that can be 
observed in practice, while 1 encodes a second base that can also be observed. A genotype ^ is a 
sequence of sets that arises from a pair of haplotypes h and h' as follows: The /th set in the sequence g is 
However, it is customary to encode the set {0} as 0, to encode {1} as 1, and {0, 1} as 2, so 
that a genotype is actually a string over the alphabet {0, 1,2}. For example, the two haplotypes 0100 and 
01 1 1 underly (we also say explain) the genotype 0122; and so do 0101 and 01 10. These haplotype pairs 
differ in the way how the 2-entries at positions three and four are determined. The haplotypes have the 
same entries at positions three and four in the first case and different entries in the second case. This fact 
can be generally stated as follows: If h and h' are explaining haplotypes for a genotype g with 2-entries 
in sites / and j {g[{\ = g[j] = 2), then either h[i\ = h[j] / h'[{\ = h'[i\ or = h'[i\ / h'[{\ = h[j\ holds. 
In the first case we say that h and h' resolve g equally in i and j and in the second case we say that h and 
h' resolve g unequally in i and j. To represent more than one haplotype, we arrange them in haplotype 
matrices where each row is a haplotype and each column corresponds to a site. For genotypes, we use 
genotype matrices. An 2nxm haplotype matrix B explains annxm genotype matrix A if for each /, the 
haplotypes in rows 2i — 1 and 2i of B explain the genotype in row / of A. 

We are interested in haplotypes that are arrangeable in a perfect phylogenetic tree. We say that a 
haplotype matrix B admits a perfect phylogeny if there exists a rooted tree T, such that: 

1. Each row of B labels exactly one node of T. 

2. Each column of B labels exactly one edge of T and each edge is labeled by at least one column. 

3. For every two rows h and h' of B and every column /, we have h[i] / h'[i] if, and only if, / lies on 
the path from h to h' in T. 

A haplotype matrix B admits a directed perfect phylogeny if B together with the all-O-haplotype admits 
a perfect phylogeny. The four gamete property is an alternative characterization for perfect phylogenies, 
observed by many authors (see fST] for references). It depends on a certain relation between pairs of 
columns: The induced set ind^(/, j) of two columns / and j in a haplotype matrix B contains all strings 
from {00,01, 10, 11} that appear in the columns / and / The four gamete property then says that a 
haplotype matrix B admits a perfect phylogeny if, and only if, for each pair of columns / and j we have 
{00,01, 10, 11} / ind^(/,7). Carried over to the directed case we know that B admits a directed perfect 
phylogeny if, and only if, for each pair of columns / and j we have {01, 10, 11} ^ ind^(/,7). We refer to 
this as the three gamete property in the following. 

We say that a genotype matrix A admits a (directed) perfect phylogeny if there exists an explain- 
ing haplotype matrix for it that admits a (directed) perfect phylogeny or, equivalently, satisfies the 
four (three) gamete property. The perfect phylogeny haplotyping problem (PPH) contains exactly the 
genotype matrices that admit a perfect phylogeny. Similar, the directed perfect phylogeny haplotyping 



2 



problem (DPPH) contains exactly the genotype matrices that admit a directed perfect phylogeny. The 
problems PPH and DPPH are closely related through first-order reductions: For a reduction from DPPH 
to PPH is suffices to append the all-O-genotype to a given genotype matrix and for the converse direction 
we can use a reduction from Eskin, Halperin and Karp L4J: In every column where a 1 -entry appears 
before a 0-entry, substitute all l-entries by 0-entries and all 0-entries by 1 -entries. For convenience we 
restrict ourselves to directed perfect phylogenies in the rest if this section and Section [3] We come back 
to undirected perfect phylogenies in Section]?] 

A genotype matrix determines, to a certain extend, the induced sets of explaining haplotype matrices. 
This is formalized by the notion of induced sets for genotype matrices in ||4l : For a genotype matrix A 
and two columns / and j, the set ind'*(/,j) contains a string xy E {00,01,10,11}, whenever A has a 
genotype g with either g[i] = x and g[j] = y, g[i] = x and g[j] = 2 or g[i\ = 2 and g[i] = y. From 
this definition follows that we have ind'^(/,7) C ind^(/,7) for any haplotype matrix B explaining A and 
ind'^(/, 7) = ind^(/, j) if A does not contain a genotype with 2-entries in both / and j. Also we know that 
A does not admit a directed perfect phylogeny whenever {01 , 10, 1 1} C ind'*(/, j) holds. 

If we consider only explaining haplotype matrices that satisfy the three gamete property, we can 
infer some information about the resolution of 2-entries: Let A be a genotype matrix and B an explaining 
haplotype matrix for A that satisfies the three gamete property. Whenever we have {01, 10} C ind'^(/, j) 
for columns / and j, we know that every genotype g of A with g[i\ = g[i] = 2 is resolved unequally in / 
and i by its haplotypes from B. Whenever we have {11} C ind'*(/, j), the genotypes are resolved equally 
in / and j. We can also infer resolutions through genotypes with at least three 2-entries: Consider a 
genotype g from A and columns /, 7 and ^ with =g[j] =g[^] = 2 and induced sets{ll} Cind'^(/,j) 
and {01, 10} C ind'^(7,^). Let h and h' be the explaining haplotypes for g from B. We know from the 
induces that h[i\ = h[i] / h'[i\ = h'[i] and h[i] = h'[k] / h'[i] = h[k] hold. This imphes h[i\ = h'[k] / 
h'[i\ = h[k], and, therefore, every genotype with 2-entries in / and k must be resolved equally by its 
haplotypes in that columns. Note, that we do not know the resolution in columns / and k from the 
induced set of these column pair. We deduced the resolution through three 2-entries in g by using 
information about the induced sets ind'*(/, j) and ind'^(7',^). The derived equal resolution in columns / 
and k may, again, trigger a resolution in another pair of columns k and / through a genotype with 2- 
entries in /, k and /. In this way resolutions may propagate through column pairs of the whole genotype 
matrix. This can possibly end up with a column pair where one genotype is forced to be resolved equally, 
while another is already resolved unequally. In this case A does not admit a perfect phylogeny. In the 
next section we describe graphs that represent resolutions in column pairs and their propagation through 
the genotype matrix. 



3 A Graph-Based Characterization for Perfect Phylogeny Haplotyping 

We present a new characterization for DPPH in terms of undirected edge-weighted graphs. The graphs 
are used to represent resolutions in column pairs and their propagation in the genotype matrix. Therefore 
we call these graphs resolution graphs. The vertices of each resolution graph are identified with columns 
of a given genotype matrix and the edges are weighted by or 1 . An edge with weight between two 
vertices k and I indicates that all genotypes must be resolved equally in columns k and I. Similar, an edge 



with weight 1 indicates that the columns are resolved unequally. Our characterization in Lemma 3.1 says 
that the absence of odd-weight cycles (the weight of a path or a cycle is the sum of its edge weights) 
from the resolution graphs is equivalent to the fact that there is a directed perfect phylogeny. 

Given an « x m genotype matrix A, we build m resolution graphs, for every column / a single 
graph Gi. A graph G, describes resolutions for a particular set A, of genotypes from A and we define the 
A, sets such that they are all pairwise disjoint. If one wants to determine the haplotypes for a particular 
genotype g, it will suffice to consider the unique graph d with g € A,. To assign the genotypes to the 
sets A,-, we use a partial order on genotype matrix columns from |i4J: A column with index / is greater 
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than a column with index j (denoted by / j) if ind'*(/,7) C {00, 10, 11} and the column vectors are 
not the same (see Figure [T] for an example). Beside this partial order, we also use the total order that is 
given by the indices of the columns of A. For every / G { 1 , . . . , m}, the set A, is then defined as follows: 

Ai = {genotype ^ of A | g[{\ = 2, 

V7 / i {g[i\ = 2 ^ not i i) and 

yj + i Mj] = 2 and yk / j {g[k] =2^notk^^ j)) =^ j > /)} . 

This definition assigns every genotype with a 2-entry to exactly one set A,. Genotypes without 2-entries 
do not need any resolution and, therefore, they are not assigned to any set. Figure [T] shows an example 
of a genotype matrix A and its sets A,. 

For every / G { 1 , . . . , m}, we now define the resolution graph G,- (again, see Figure[T]for an example). 
As already stated, the vertices Vi of G, are identified with columns from A. A vertex with index k lies in 
Vi if, and only if. A, contains a genotype with a 2-entry in column k. The edges Ej C {e C Vi \ \e\ = 2} 
and their weights w, : ^ {0, 1} are constructed as follows: 

{k, 1} G Ei and Wi{{k, /}) = if, and only if, there exists gi G A,- with gi [k] = gi [I] = 2 and 

(a) 11 G ind{k,l), or 

(b) there is a column j ^ i and g2 G Ay with g2[k] = g2[l] = 2 

{k, 1} G Ei and Wi{{k, /}) = 1 if, and only if, there exists gi G A; with gi [k] = gi [I] = 2 

and {01,10} Cind^()t,0 

The characterization for DPPH is as follows: 

Lemma 3.1. An nxm genotype matrix A admits a directed perfect phylogeny if, and only if for each 
pair /, 7 G { 1 , . . . , m}, we have {01,10,11} ^ ind'* (/, j), and for each / G { 1 , . . . , m}, Gi does not contain 
an odd-weight cycle. 

Proof. Only-if-part: Let A be a genotype matrix and B a haplotype matrix for it that satisfies the three 
gamete property. Thus, for every pair of columns / and 7, we have {01, 10, 11} ^ind^(/,7) and, therefore, 
{01, 10, 11} ^ ind'*(/,7). To prove that none of the resolution graphs has an odd-weight cycle, we first 
show the following property: 

Claim. Let A be a genotype matrix and B a haplotype matrix for it that satisfies the three gamete 
property. Let i, k and I be columns and g a genotype with g G A, and g\k] = g[l\ = 2. IfGi contains an 
edge with weight 1 between k and I, then B resolves g unequally in k and I. If the weight is 0, then the 
resolution is equal. 

Proof. If there is a 1-weighted edge between k and I, we have {01, 10} G ind'^(^,/) and, therefore, g 
must be resolved unequally in k and /. If there is a 0-weighted edge between k and /, it is constructed 
for one of two reason: Whenever 11 G ind'^(^,/), we know by the three gamete property that g must 
be resolved equally in k and /. We are left with the case that there is another column j and a genotype 
g2 G Aj with g2[k] = g2[l] = 2. In this case, the following matrix shows what we know about the entries 
of gi and g2, where gi[j] and g2[i] are values from {0, 1,2}: 



^1 

82 



j i k I 

gi[j] 2 2 2 
2 g2[i] 2 2 
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Genotype 
matrix A: 

12 3 4 5 6 7 

0001010 

1220000 
2222000 
1200200 
0002020 
0002022 



Partial order on 
the columns of A: 




Assignment of 
genotypes to sets A,: 

Ai = { 2222000 } 

12 2 \ 
12 2 I 



2 2 
2 2 2 





-{ 


A3 


= 


A4 


-{ 


A5 


= A6 



Resolution graphs of colirams 1, 2 and 4 with edge weights: 




Figure 1 : For a genotype matrix A with seven colunms, this figure shows the corresponding partial 

order on the columns of A, the assignment of genotypes to sets A,-, and the construction of resolution 
graphs Gi. Some sets A, are empty and, therefore, the corresponding resolution graphs are also empty. 
A directed edge in the partial order from a vertex with index A: to a vertex with index I means k I. 
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By definition a genotype is contained in a set A, if column / is the maximal column (with respect to >-^) 
with the lowest index among all columns with a 2-entry in the genotype. Since / and j are not the same, 
this implies that at least one of the entries gi[j] and g2[i] does not equal 2. We distinguish between the 
possible values for them (possible values are g2[i] = 1, g2\i] =0, gi [j] = 1 and gi [j] = 0) and show that 
we have hi[k] = hi [I] ^ h\ [k] = h\ [I] for the explaining haplotypes h\ and h\ of g\ : 

Case g2[t\ = 1: We know 11 G ind'^(/,^), 11 G ind'*(/,Z) and, therefore, hi[{\ = hi[k] / h\[i\ = h\[k] 
and /Ji [/] = /Ji [/] / h\ [/] = h\ [I] which implies hi[k]=hi[l]^ h\ [k] = h\ [I]. 

Case g2[i\ = 0: By assumption column k is not greater than column / which impUes, together with 
the fact that they are not the same, 10 S ind'*(/,^). Similarly, we have 10 G ind'*(/,Z) for columns / 
and /. Furthermore, the 0-entry in g2[i\ and the 2-entries in and g2[l] ensure that 01 G ind'^(/,^) and 
01 G ind'^(/, /). The resulting induced sets force an unequally resolution of in both / and k, and / and /. 
i:h\xshi[{\=h\[k]^h\[{\=hi[k],hi\i]=h\[l]^h\[{\ =/ji[/] and, therefore, /ji [A:] = /jj/] / /j'jA:] = /j'J/]. 

Case g\ [j] = 1 and [j] = are similar to case g2[i\ = 1 and case g2[i\ = 0, respectively. Thus, we 
proved the claim. □ 

We assume, for sake of contradiction, that there is a column / such that G, contains an odd-weight 
cycle ii,ei,i2---ip,ep,ip+i = i\, where the ij are column indices and the ej are weighted edges from 
Ej. For every edge ej let gj G A,- be a genotype with gj[ij] = gj[ij+i] = 2 and let hj and h'j be the 
explaining haplotypes for gj. The above claim and the discussion about induced sets in Section [2] imply 
the following two properties: First, if hj and h'j resolve gj in columns / and ij equally, then they resolve 
gj in columns / and /y+i equally if ej has weight and unequally, if the weight is 1. Second, if hj 
and h'j resolve gj in columns / and ij unequally, they resolve gj in columns / and /y+i equally if ey has 
weight 1 and unequally, if the weight is 0. Thus, when an edge ej has weight 1, the resolution alternates 
between the column pair / and ij and the column pair / and /y+i. The resolution does not alter if the 
weight is 0. Now assume that is resolved equally by hi and h\ in columns / and /i. From the above 
property follows that for every j G {l,...,p}, the genotype gj is resolved equally by its haplotypes in 
columns / and /y+i if the path /i,ei,/2 • • - ij^^j^ij+i has an even weight, and unequally otherwise. If we 
consider the whole odd- weight cycle, this fact yields a contradiction to the resolution of gi in / and ii. 
The assumption that g\ is resolved unequally in columns / and ii is also contradictory. 

If-part: Let A be a genotype matrix such that for every / and j we have {01, 10, 11} ^ ind'^(/,7) and 
no resolution graph has an odd- weight cycle. We construct an explaining haplotype matrix B for A and 
show that it satisfies the three gamete property. 

Construction: For genotypes without 2-entries, the explaining haplotypes are simply copies of them. 
The other genotypes (with 2-entries) are partitioned by the sets Ai , . . . , A„, and we treat every set A, and 
its graph G, separately. First we transform G; into a connected graph G-: From every component that 
does not contain /, we select a vertex and connect it to / by an even-weight edge. Since G,- does not 
contain odd-weight cycles, each vertex is connected to / by either only even-weight paths or only odd- 
weight paths in G-. For each genotype g G A,-, we construct two explaining haplotypes h and h' as 
follows: For every column j with g[j] = 2, we set h[j] = and h'[j] = 1, if there is an even-weight path 
between / and j in G- and h[j] = 1 and h'[j] = 0, otherwise. The 0-entries and 1-entries are copied to 
both haplotypes. 

Three gamete property: We are left to prove that {01,10,11} ^ ind^(yc,/) holds for every column 
pair k and I. If there is no genotype g with 2-entries in k and I, we simply have ind^(^,/) = md'^{k,l) 
and the three gamete property holds for that columns by assumption. Consider two columns k and I that 
have 2-entries in a common genotype. We distinguish whether 11 G md^{k,l), {01, 10} C ind'*(^,/) or 
no of them holds and show that {01, 10, 11} ^ ind^(^,/) is true in all cases. 

1. Let 11 G md^{k,l), g a genotype with g[k] = g[l] = 2 and g G A,- for a column /. We know that 
there is an edge (0, {k, /}) in G- and, therefore, either all paths from / to k and all paths from / to / 
have an even weight or all paths from / to k and all paths from / to I have an odd weight. The 
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construction of the haplotypes implies that g is resolved equally in k and I. Thus, the induced set 
in columns k and / is extended only by the strings 00 and 11, which does not conflict with the 
three gamete property. 

2. Let {01, 10} C ind'^(^,/), g a genotype with g[k] = g[l] = 2 and / with g € A,. This implies that 
there is an edge ( 1 , /}) in G- and, therefore, either all paths from / to k have an even weight and 
all paths from / to / have an odd weight or, conversely, all paths from / to k have an odd weight 
and all paths from / to / have an even weight. Thus, g is resolved unequally by it haplotypes in k 
and I, which yields the additonal induced strings 01 and 10. 

3. Assume that neither 1 1 G ind'*(^, /) nor {01 , 10} C md^{k, I) holds. If the genotypes with 2-entries 
in columns k and / lie in the same set A,, then they are all resolved equally or all resolved unequally 
in k and /, since their resolution depends on same graph If the genotypes with 2-entries in k 
and / are distributed among multiple sets A,, the corresponding resolution graphs contain even- 
weight edges between k and / by definition. Similar to the first case, the construction assures that 
these genotypes are resolved equally in k and /. 

In all cases we proved that the resolutions of column pairs (and the resulting extension of induced sets) 
do not conflict with the three gamete property. □ 



4 Reduction from PPH to the Bipartition Problem 

We use our characterization from the last section to show that PPH can be reduced to the question of 
whether an undirected graph is bipartite. Bipartite graphs are characterized by the absence of an odd- 
length path and the formal BIPARTITION problem contains exactly the graphs with this property. 

Lemma 4.1. pph reduces to BIPARTITION via first-order-reductions. 

Proof. Construction: The reduction procedure consists of three steps: For a given genotype matrix A, 
we first apply the reduction from PPH to DPPH from 0, which is described in Section [2] This yields a 
genotype matrix A'. Then we construct a graph G that is the disjoint union of all resolution graphs of 
column of A'. In the last step, we substitute each weighted edge in G by a path of length 2 and, finally, 
delete all edge weights. This yields the graph G' . 

Correctness: First, A admits a perfect phylogeny exactly if A' admits a directed perfect phylogeny. 



Furthermore, we know from Lemma 3.1 that A' admits a directed perfect phylogeny if, and only if, G 
does not contain an odd-weight cycle. The insertion of paths of length 2 transforms even-weight paths 
into even-length paths and odd-weight paths into odd-length paths. Therefore, G does not contain an 
odd-weight cycle if, and only if, G' is bipartite, which proves the lemma. □ 

Since BIPARTITION G L HI and PPH is L-hard Q, the reduction gives the last step to prove Theo- 
rem [TTT] from the introduction. 



5 Conclusion 

In this paper we settled the main open problem from |l3l and showed that perfect phylogeny haplotyping 
is solvable in deterministic logarithmic space and, therefore, is also L-complete (a hardness proof can 
be found in [3 |). We introduced a characterization of PPH in terms of resolution graphs that represent 
resolutions of 2-entries in genotypes. We proved that the question of whether there are resolutions 
that conflict with the existence of perfect phylogenies is closely related to the bipartition problem for 
undirected graphs. This yields a reduction from PPH to the bipartition problem, which can be seen as a 
conceptual easy and efficient approach to decide and construct perfect phylogenies for genotype data. 
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